Appendix B — Merging data frames

Sometimes information relating to the same subjects or observations might be stored in two separate data frames. When this is the case it is easy to combine two data frames using the merge() function. merge() takes the following arguments:

Only the arguments x = and y = are required. When any of of the by. = arguments are left out of the function, R will automatically look for columns which share the same name in the two data sets. When any of the all. = arguments are left out, they default to FALSE, so only complete cases are kept in the final merged data frame.

The file treatment.csv contains information on whether patients in a study testing a new treatment for high cholesterol were given the new drug or a placebo drug with no effect. Some of the patients in this new study are subjects from the chol data frame. We can read in the file treatment.csv and merge it with chol_full in order to see all the information available on a subject.

To start with, we need to read in treatment.csv and save this as a data frame called treatment.

treatment <- read.csv(file = "treatment.csv")

Then we can merge chol_full and treatment into a single data frame called patients using the following code.

patients <- merge(x = chol_full, y = treatment, by.x = "id",
                  by.y = "patient_id", all = TRUE)
head(patients[, -c(1:3)])
  trig age gender     smoke weight height      bmi treatment
1  120  48   male   current  99.02   1.70 34.26298      <NA>
2   NA  42   male      <NA>  73.99   1.84 21.85432 Treatment
3   NA  NA   <NA>      <NA>     NA     NA       NA Treatment
4   67  48 female        no  74.90   1.61 28.89549 Treatment
5   NA  NA   <NA>      <NA>     NA     NA       NA Treatment
6  139  30 female ex-smoker  86.25   1.83 25.75473   Placebo

Because the column showing the patient ID has a different name in chol_full and treatment, we have had to specify what it is called in each data frame here using by.x = and by.y = (make sure to check the contents of your data frames to notice things like this!). The argument all = TRUE means that we are keeping all information from both data frames, regardless of whether a patient only appears in chol_full or only in treatment. This is why in the excerpt of patients above, there are rows where the value for all variables except treatment are NA.

Task

The file class.csv contains information on the average primary class size in the years 2016 - 2022. You can read this file into R and save it as a data frame called class using the following code.

class <- read.csv(file = "class.csv")

Merge the information from the data frames education and class together into a new data frame called primary, showing all variables from education and the average class size for primary schools only. Look carefully at which row names these two data frames have in common.

primary <-(x =, y =,
    by = c(), all.y =`)

For more information on merging data set, see Section 1.11.4 Merging Data Frames of Probability and Statistics with R.